Exploratory Data Analysis for Data Science and Machine Learning¶

IBM Guided Project¶

Importing Required Libraries¶

InĀ [3]:
import random
import missingno as msno
import numpy as np
import pandas as pd

import seaborn as sns
sns.set_context('notebook') # Configures the aesthetics of the plots for jupyter notebook
sns.set_style('white') # Sets background style of plots to white

import matplotlib.pyplot as plt
%matplotlib inline 
# ensures that inline plotting works correctly (newer versions of jypyter notbook does not need this)

from scipy.stats import shapiro

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.datasets import load_diabetes, load_iris
from sklearn.linear_model import LinearRegression
from sklearn.metrics import root_mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer

from fasteda import fast_eda

Regression¶

The aim is to predict a numeric score indicating diabetes progression one year after bloor pressure, BMI and bloor sugar level are recorder using Regression.

Load the diabetes data set (from sklearn)¶

InĀ [6]:
# About the data
print(load_diabetes()['DESCR'])
.. _diabetes_dataset:

Diabetes dataset
----------------

Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.

**Data Set Characteristics:**

:Number of Instances: 442

:Number of Attributes: First 10 columns are numeric predictive values

:Target: Column 11 is a quantitative measure of disease progression one year after baseline

:Attribute Information:
    - age     age in years
    - sex
    - bmi     body mass index
    - bp      average blood pressure
    - s1      tc, total serum cholesterol
    - s2      ldl, low-density lipoproteins
    - s3      hdl, high-density lipoproteins
    - s4      tch, total cholesterol / HDL
    - s5      ltg, possibly log of serum triglycerides level
    - s6      glu, blood sugar level

Note: Each of these 10 feature variables have been mean centered and scaled by the standard deviation times the square root of `n_samples` (i.e. the sum of squares of each column totals 1).

Source URL:
https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html

For more information see:
Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani (2004) "Least Angle Regression," Annals of Statistics (with discussion), 407-499.
(https://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf)

InĀ [7]:
# Load the data from sklearn as two pandas.DataFrame - features (X) and target variable (y)
diabetes_X, diabetes_y = load_diabetes(return_X_y = True, as_frame = True, scaled = False)

#Renaming columns
diabetes_X.columns= ['age', 'sex', 'bmi', 'bp', 'tc', 'ldl', 'hdl', 'tch', 'ltg', 'glu']
InĀ [8]:
diabetes_X.head()
Out[8]:
age sex bmi bp tc ldl hdl tch ltg glu
0 59.0 2.0 32.1 101.0 157.0 93.2 38.0 4.0 4.8598 87.0
1 48.0 1.0 21.6 87.0 183.0 103.2 70.0 3.0 3.8918 69.0
2 72.0 2.0 30.5 93.0 156.0 93.6 41.0 4.0 4.6728 85.0
3 24.0 1.0 25.3 84.0 198.0 131.4 40.0 5.0 4.8903 89.0
4 50.0 1.0 23.0 101.0 192.0 125.4 52.0 4.0 4.2905 80.0
InĀ [9]:
diabetes_y.head()
Out[9]:
0    151.0
1     75.0
2    141.0
3    206.0
4    135.0
Name: target, dtype: float64
InĀ [10]:
# Combine both diabetes_X (features) and diabetes_y (target) into one pandas.DataFrame
diabetes = pd.concat([diabetes_X, pd.Series(diabetes_y)], axis=1)

#Renaming the column with target value
diabetes.rename(columns={0: 'target'}, inplace=True)
InĀ [11]:
# Looking into the data
diabetes.sample(5)
Out[11]:
age sex bmi bp tc ldl hdl tch ltg glu target
43 54.0 1.0 24.2 74.0 204.0 109.0 82.0 2.0 4.1744 109.0 92.0
156 44.0 1.0 25.4 95.0 162.0 92.6 53.0 3.0 4.4067 83.0 25.0
394 58.0 1.0 28.1 111.0 198.0 80.6 31.0 6.0 6.0684 93.0 273.0
88 34.0 2.0 22.6 75.0 166.0 91.8 60.0 3.0 4.2627 108.0 42.0
361 60.0 1.0 25.7 103.0 158.0 84.6 64.0 2.0 3.8501 97.0 182.0

Add some missing values¶

The original dataset does not contain any missing value hence for the sake of EDA missing values are introduced to 3 columns and 5% of the rows at random

InĀ [13]:
# Verifying that the data set has no missing values
diabetes.isna().max(axis=0).max()
Out[13]:
False
InĀ [14]:
# Intializing seed value to 2000 to make sure that the random value is same each time the code is executed
random.seed(2024)

# Selecting 3 columns at random
missing_cols = random.sample(range(len(diabetes.columns)), 3)

# Selecting 5% of row index at random
missing_rows = random.sample(diabetes.index.tolist(), int(np.round(len(diabetes.index.tolist())/20)))

# Setting missing values to the randomly selected rows and columns
diabetes.iloc[missing_rows, missing_cols] = np.nan
InĀ [15]:
# Having a look at the columns which has been selected in random
print(diabetes.columns[missing_cols])
Index(['tch', 'bmi', 'tc'], dtype='object')
InĀ [16]:
# Now verifying that the data set has missing values
diabetes.isna().max(axis=0).max()
Out[16]:
True

Initial Data Preprocessing¶

Note: In a typical workflow data preprocessing comes after conducting EDA

One-Hot Encoding¶

In diabetes dataset sex is encoded as 1 and 2 for female and male, this is not ideal for predictive models as it may consided that the column has some ordering to it. Hence we use One-Hot encoding to create two different columns for each category of sex with binary values in it.

InĀ [19]:
# Initializing OneHotEncoder (ignore unknown categories in dataset, no categories are dropped)
enc1 = OneHotEncoder(handle_unknown='ignore', drop=None)

# One-hot encode 'sex'. 
# Double square brackets are used to ensure that the extracted sex data is in DataFrame format which is required by One-hot encoder
# The output from OneHotEncoder is sparse matrix (stores only non-zero elements to save memory) which is converted to numpy array
encoded_sex = enc1.fit_transform(diabetes[['sex']]).toarray()

# Convert numpy array to pandas DataFrame with column names corresponding to its sex category 
encoded_sex = pd.DataFrame(encoded_sex, columns=['sex' + str(int(x)) for x in enc1.categories_[0]])

# Horrizontally concatenate dataframes'diabetes' and 'encoded_sex'
diabetes = pd.concat([diabetes, encoded_sex], axis=1)

# Looking into the modified diabetes DataFrame
diabetes.sample(5)
Out[19]:
age sex bmi bp tc ldl hdl tch ltg glu target sex1 sex2
411 50.0 1.0 31.8 82.0 136.0 69.2 55.0 2.0 4.0775 85.0 136.0 1.0 0.0
78 50.0 1.0 21.0 88.0 140.0 71.8 35.0 4.0 5.1120 71.0 252.0 1.0 0.0
403 43.0 1.0 35.4 93.0 185.0 100.2 44.0 4.0 5.3181 101.0 275.0 1.0 0.0
264 58.0 2.0 29.0 85.0 156.0 109.2 36.0 4.0 3.9890 86.0 145.0 0.0 1.0
5 23.0 1.0 22.6 89.0 139.0 64.8 61.0 2.0 4.1897 68.0 97.0 1.0 0.0

From above sex is indicated through sex, sex1 and sex2, two of which is redundant hence sex and sex2 can be dropped

InĀ [21]:
# Drop 'sex' and 'sex2' from diabetes DataFrame
diabetes = diabetes.drop(['sex', 'sex2'], axis=1)

# Rename 'sex1' to 'sex'
diabetes = diabetes.rename(columns={'sex1': 'sex'})

# Reorder renamed 'sex' to the previous 'sex' position
diabetes = diabetes.loc[:, ['age', 'sex', 'bmi', 'bp', 'tc', 'ldl', 'hdl', 'tch', 'ltg', 'glu', 'target']]

# Looking into a sample of the modified diabetes DataFrame
diabetes.sample(5)
Out[21]:
age sex bmi bp tc ldl hdl tch ltg glu target
139 55.0 1.0 32.1 110.0 164.0 84.2 42.0 4.0 5.2417 90.0 281.0
307 67.0 0.0 23.5 96.0 207.0 138.2 42.0 5.0 4.8978 111.0 172.0
282 68.0 1.0 25.9 93.0 253.0 181.2 53.0 5.0 4.5433 98.0 230.0
433 41.0 1.0 20.8 86.0 223.0 128.2 83.0 3.0 4.0775 89.0 72.0
275 47.0 0.0 25.3 98.0 173.0 105.6 44.0 4.0 4.7622 108.0 94.0

Make a Train-Test Split¶

Below code will randomly assign 33% of the rows to test set and the remaining 67% to training set. Training set is used to train the predictive models and the test set is the unseen data on which predictions are made.

InĀ [23]:
# Make a Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(
    diabetes.iloc[:,:-1], # Features data (all columns except the last)
    diabetes.iloc[:,-1], # Target data (last column)
    test_size=0.33, # 33% for testing
    random_state=2024 # for reproducibility
)

# `X_train` are the feature columns in the training set.
# `X_test` are the feature columns in the test set.
# `y_train` is the target column for the training set.
# `y_test` is the target column for the test set.

Perform EDA¶

A look at the beginning and end of the data set¶

InĀ [26]:
diabetes.head()
Out[26]:
age sex bmi bp tc ldl hdl tch ltg glu target
0 59.0 0.0 32.1 101.0 157.0 93.2 38.0 4.0 4.8598 87.0 151.0
1 48.0 1.0 21.6 87.0 183.0 103.2 70.0 3.0 3.8918 69.0 75.0
2 72.0 0.0 30.5 93.0 156.0 93.6 41.0 4.0 4.6728 85.0 141.0
3 24.0 1.0 25.3 84.0 198.0 131.4 40.0 5.0 4.8903 89.0 206.0
4 50.0 1.0 23.0 101.0 192.0 125.4 52.0 4.0 4.2905 80.0 135.0
InĀ [27]:
diabetes.tail()
Out[27]:
age sex bmi bp tc ldl hdl tch ltg glu target
437 60.0 0.0 28.2 112.00 185.0 113.8 42.0 4.00 4.9836 93.0 178.0
438 47.0 0.0 24.9 75.00 225.0 166.0 42.0 5.00 4.4427 102.0 104.0
439 60.0 0.0 24.9 99.67 162.0 106.6 43.0 3.77 4.1271 95.0 132.0
440 36.0 1.0 30.0 95.00 201.0 125.2 42.0 4.79 5.1299 85.0 220.0
441 36.0 1.0 19.6 71.00 250.0 133.2 97.0 3.00 4.5951 92.0 57.0

Describe the DataFrame¶

InĀ [29]:
# Having a look at the general statistical summaries for the diabetes DataFrame
diabetes.describe()
Out[29]:
age sex bmi bp tc ldl hdl tch ltg glu target
count 442.000000 442.000000 420.000000 442.000000 420.000000 442.000000 442.000000 420.000000 442.000000 442.000000 442.000000
mean 48.518100 0.531674 26.358095 94.647014 188.830952 115.439140 49.788462 4.071595 4.641411 91.260181 152.133484
std 13.109028 0.499561 4.404820 13.831283 34.690827 30.413081 12.934202 1.296942 0.522391 11.496335 77.093005
min 19.000000 0.000000 18.000000 62.000000 97.000000 41.600000 22.000000 2.000000 3.258100 58.000000 25.000000
25% 38.250000 0.000000 23.175000 84.000000 164.000000 96.050000 40.250000 3.000000 4.276700 83.250000 87.000000
50% 50.000000 1.000000 25.700000 93.000000 186.000000 113.000000 48.000000 4.000000 4.620050 91.000000 140.500000
75% 59.000000 1.000000 29.325000 105.000000 209.000000 134.500000 57.750000 5.000000 4.997200 98.000000 211.500000
max 79.000000 1.000000 42.200000 133.000000 301.000000 242.400000 99.000000 9.090000 6.107000 124.000000 346.000000

Missing Values¶

InĀ [31]:
# We know that the dataframe has missing values which can be verified below
diabetes.isna().max(axis=1).max()
Out[31]:
True
InĀ [32]:
# To see the summary of missing values in each column
diabetes.isna().sum()
Out[32]:
age        0
sex        0
bmi       22
bp         0
tc        22
ldl        0
hdl        0
tch       22
ltg        0
glu        0
target     0
dtype: int64
InĀ [33]:
# Visualizing the missing values in diabetes dataframe
msno.matrix(diabetes)
Out[33]:
<Axes: >
No description has been provided for this image

It can be easily observed how the missing values occur over the three columns bmi, s1 & s4. There are typically three approaces in dealing with the missing values,

  • Dropping the observation with missing values
  • Filling the observations with missing values with the mean
  • Filling the observations with missing values with the median

Dropping the observations with missing values¶

InĀ [36]:
# Linear refression with dropping NANs

# Getting the Non-NANs indices (observations/rows) of X_train and X_test
nonnan_train_indices = X_train.index[~X_train.isna().max(axis=1)]
nonnan_test_indices = X_test.index[~X_test.isna().max(axis=1)]

# Fit an instance of Linear Regression with train dataset
reg = LinearRegression().fit(X_train.loc[nonnan_train_indices], y_train.loc[nonnan_train_indices])

# Generate predictions for the test dataset
pred = reg.predict(X_test.loc[nonnan_test_indices])

# Finding the root mean squared error between prediction vs test target (y_test)
root_mean_squared_error(y_test.loc[nonnan_test_indices], pred)
Out[36]:
55.962919546725054

Filling the observations with missing values with the mean¶

InĀ [38]:
# Linear regressing with mean fill

# Getting the Non-NAN indices (observations/rows) of X_test as only missing values in train dataset will be filled with mean 
nonnan_test_indices = X_test.index[~X_test.isna().max(axis=1)]

# Initializing simple imputer with 'mean' strategy. 
# Note: Simple imputer supports mean, median, most_frequest and constant strategies
imp_mean = SimpleImputer(missing_values = np.nan, strategy = 'mean')

# Fit the simple imputer using the training data
imp_mean.fit(X_train)

# Transforming X_train to mean filled dataset and converting it to a pandas DataFrame
X_train_mean_fill = pd.DataFrame(imp_mean.transform(X_train))

# Assigning column names to the above dataframe
X_train_mean_fill.columns= ['age', 'sex', 'bmi', 'bp', 'tc', 'ldl', 'hdl', 'tch', 'ltg', 'glu']

# Fit an instance of Linear Regression with mean filled train dataset
reg = LinearRegression().fit(X_train_mean_fill, y_train)

# Generate predictions for the test dataset
pred = reg.predict(X_test.loc[nonnan_test_indices])

# Finding the root mean squared error between prediction vs test target (y_test)
root_mean_squared_error(y_test.loc[nonnan_test_indices], pred)
Out[38]:
55.95122410079265

Filling the observations with missing values with the median¶

InĀ [40]:
# Linear regressing with median fill

# Getting the Non-NAN indices (observations/rows) of X_test as only missing values in train dataset will be filled with median
nonnan_test_indices = X_test.index[~X_test.isna().max(axis=1)]

# Initializing simple imputer with 'median' strategy. 
# Note: Simple imputer supports mean, median, most_frequest and constant strategies
imp_median = SimpleImputer(missing_values = np.nan, strategy = 'median')

# Fit the simple imputer using the training data
imp_median.fit(X_train)

# Transforming X_train to median filled dataset and converting it to a pandas DataFrame
X_train_median_fill = pd.DataFrame(imp_median.transform(X_train))

# Assigning column names to the above dataframe
X_train_median_fill.columns= ['age', 'sex', 'bmi', 'bp', 'tc', 'ldl', 'hdl', 'tch', 'ltg', 'glu']

# Fit an instance of Linear Regression with median filled train dataset
reg = LinearRegression().fit(X_train_median_fill, y_train)

# Generate predictions for the test dataset
pred = reg.predict(X_test.loc[nonnan_test_indices])

# Finding the root mean squared error between prediction vs test target (y_test)
root_mean_squared_error(y_test.loc[nonnan_test_indices], pred)
Out[40]:
55.9148764740674

The root mean squared error is minimum for linear regressing with missing values filled with median. Looking into ways to impove this.¶

Histograms and Boxplots¶

InĀ [43]:
# Define a function that takes columns_toplt as an argument
def plot_hist_and_box(diabetes, columns_toplt):
    for idx, col in enumerate(columns_toplt): 
        # Creates two subplots (2 plots in a row)
        fig, (ax1, ax2) = plt.subplots(1, 2, figsize = (14, 6)) 
        
        # Creating a histogram in first subplot (ax1) with KDE overlay 
        sns.histplot(diabetes, x=diabetes[col], kde=True,
                     color=sns.color_palette('hls', len(columns_toplt))[idx], ax=ax1)
        
        # Creating a boxplot in second subplot (ax2) with the same color as histogram
        sns.boxplot(diabetes, x=diabetes[col], width=0.4, linewidth=3, fliersize=2.5, 
                    color=sns.color_palette('hls', len(columns_toplt))[idx], ax=ax2)
        
        # Adding title to the figure
        fig.suptitle(f"Histogram and Boxplot of {col}", size=20, y=1.02)
        plt.show()
InĀ [44]:
# Assigning column names of all columns in diabetes dataframe except 'sex'
columns_toplt = [i for i in diabetes.columns if i != 'sex']

# Call the function
plot_hist_and_box(diabetes, columns_toplt)
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
InĀ [45]:
# looking closely into hdl column
# Assigning column name 'hdl'
columns_toplt = ['hdl']

# Call the function
plot_hist_and_box(diabetes, columns_toplt)
No description has been provided for this image

Normality Test on 'hdl'¶

InĀ [47]:
# Normality test on 'hdl'
stat, p = shapiro(X_train['hdl'])
print('Statistics = %.3f, p = %.3f' % (stat, p))

# Interpret
alpha = 0.05
if p > alpha:
    print("Sample is normally distributes (Fail to reject null hypothesis)")
else:
    print("Sample is not normally distributes(reject null hypothesis)")
Statistics = 0.962, p = 0.000
Sample is not normally distributes(reject null hypothesis)

Normality Test on log of 'hdl'¶

InĀ [49]:
# Normality test on 'hdl'
stat, p = shapiro(np.log(X_train['hdl']))
print('Statistics = %.3f, p = %.3f' % (stat, p))

# Interpret
alpha = 0.05
if p > alpha:
    print("Sample is normally distributes (Fail to reject null hypothesis)")
else:
    print("Sample is not normally distributes(reject null hypothesis)")
Statistics = 0.996, p = 0.700
Sample is normally distributes (Fail to reject null hypothesis)

Linear Regression with missing observations filled with median and log of 'hdl'¶

InĀ [51]:
# Replacing 'hdl' column in X_train and X_test with log of 'hdl'
X_train['hdl'] = np.log(X_train['hdl'])
X_test['hdl'] = np.log(X_test['hdl'])

# Getting Non-NAN index values for X_test
nonnan_test_indices = X_test.index[~X_test.isna().max(axis=1)]

# Initializing simple imputer with 'median' strategy
imp_median = SimpleImputer(missing_values = np.nan, strategy = 'median')

# Fit simple imputer with training data (with log 'hdl')
imp_median.fit(X_train)

# Transforming X_train to median filled dataset and converting it to a pandas DataFrame
X_train_median_log_fill = pd.DataFrame(imp_median.transform(X_train))

# Assigning column names to the above dataframe
X_train_median_log_fill.columns= ['age', 'sex', 'bmi', 'bp', 'tc', 'ldl', 'hdl', 'tch', 'ltg', 'glu']

# Fit an instance of Linear Regression
reg = LinearRegression().fit(X_train_median_log_fill, y_train)

# Generate prediction for X_test dataset
pred = reg.predict(X_test.loc[nonnan_test_indices])

# Calculate Root Mean Squared error 
root_mean_squared_error(y_test.loc[nonnan_test_indices], pred)
Out[51]:
55.685863090763554

Root mean squared error has improved. Looking into column 'ldl' for more improvements¶

InĀ [53]:
# looking closely into ldl column
# Assigning column name 'ldl'
columns_toplt = ['ldl']

# Call the function
plot_hist_and_box(diabetes, columns_toplt)
No description has been provided for this image

Linear Regression with missing observations filled with median, log of 'hdl' and removal of oultiers in 'ldl'¶

InĀ [55]:
# Removing outlier index from 'idl'
X_train_nonoutlier_idx = X_train.index[X_train.ldl < X_train.ldl.quantile(0.999)]
X_train = X_train.loc[X_train_nonoutlier_idx]
y_train = y_train.loc[X_train_nonoutlier_idx]

# Getting Non-NAN index values for X_test
nonnan_test_indices = X_test.index[~X_test.isna().max(axis=1)]

# Initializing simple imputer with 'median' strategy
imp_median = SimpleImputer(missing_values = np.nan, strategy = 'median')

# Fit simple imputer with training data (with log 'hdl')
imp_median.fit(X_train)

# Transforming X_train to median filled dataset and converting it to a pandas DataFrame
X_train_median_log_fill = pd.DataFrame(imp_median.transform(X_train))

# Assigning column names to the above dataframe
X_train_median_log_fill.columns= ['age', 'sex', 'bmi', 'bp', 'tc', 'ldl', 'hdl', 'tch', 'ltg', 'glu']

# Fit an instance of Linear Regression
reg = LinearRegression().fit(X_train_median_log_fill, y_train)

# Generate prediction for X_test dataset
pred = reg.predict(X_test.loc[nonnan_test_indices])

# Calculate Root Mean Squared error 
root_mean_squared_error(y_test.loc[nonnan_test_indices], pred)
Out[55]:
55.53368308287885

Correlation Matrix¶

InĀ [57]:
plt.figure(figsize = (12, 8))
sns.heatmap(diabetes.corr(), annot = True, cmap = 'Spectral', linewidth = 2, linecolor = '#000000', fmt = '.3f')
plt.show()
No description has been provided for this image

It can be observed that the correlation of 'tc' and 'idl' to 'target' is very low. Hence the we might be able to improve the regression model by dropping the column 'tc'

Linear Regression with meadian filled, log of 'hdl', removed outliers in 'idl' and dropping 'tc'¶

InĀ [60]:
# Removing outlier index from 'idl'
X_train_nonoutlier_idx = X_train.index[X_train.ldl < X_train.ldl.quantile(0.999)]
X_train = X_train.loc[X_train_nonoutlier_idx]
y_train = y_train.loc[X_train_nonoutlier_idx]

# Getting Non-NAN index values for X_test
nonnan_test_indices = X_test.index[~X_test.isna().max(axis=1)]

# Getting column names except 'tc'
col_no_tc = [i for i in X_train.columns if i != 'tc']

# Initializing simple imputer with 'median' strategy
imp_median = SimpleImputer(missing_values = np.nan, strategy = 'median')

# Fit simple imputer with training data (with log 'hdl')
imp_median.fit(X_train.loc[:, col_no_tc])

# Transforming X_train to median filled dataset and converting it to a pandas DataFrame
X_train_median_log_fill = pd.DataFrame(imp_median.transform(X_train.loc[:, col_no_tc]))

# Assigning column names to the above dataframe
X_train_median_log_fill.columns= ['age', 'sex', 'bmi', 'bp', 'ldl', 'hdl', 'tch', 'ltg', 'glu']

# Fit an instance of Linear Regression
reg = LinearRegression().fit(X_train_median_log_fill, y_train)

# Generate prediction for X_test dataset
pred = reg.predict(X_test.loc[nonnan_test_indices, col_no_tc])

# Calculate Root Mean Squared error 
root_mean_squared_error(y_test.loc[nonnan_test_indices], pred)
Out[60]:
55.619929987460424

Removal of 'tc' column has lead to worse performance

Pair Plots¶

InĀ [63]:
sns.pairplot(diabetes)
plt.show()
No description has been provided for this image

A Simple function to perform EDA - fasteda¶

The fast_eda from fasteda package does all the above EDA analysis in single step

InĀ [66]:
import warnings
# Suppress all warnings
warnings.filterwarnings('ignore')

# Now run fast_eda(diabetes) function
fast_eda(diabetes)
DataFrame Head:
age sex bmi bp tc ldl hdl tch ltg glu target
0 59.0 0.0 32.1 101.0 157.0 93.2 38.0 4.0 4.8598 87.0 151.0
1 48.0 1.0 21.6 87.0 183.0 103.2 70.0 3.0 3.8918 69.0 75.0
2 72.0 0.0 30.5 93.0 156.0 93.6 41.0 4.0 4.6728 85.0 141.0
DataFrame Tail:
age sex bmi bp tc ldl hdl tch ltg glu target
439 60.0 0.0 24.9 99.67 162.0 106.6 43.0 3.77 4.1271 95.0 132.0
440 36.0 1.0 30.0 95.00 201.0 125.2 42.0 4.79 5.1299 85.0 220.0
441 36.0 1.0 19.6 71.00 250.0 133.2 97.0 3.00 4.5951 92.0 57.0
----------------------------------------------------------------------------------------------------
Missing values:
Ā  0
bmi 22
tc 22
tch 22
----------------------------------------------------------------------------------------------------
MSNO Matrix:

No description has been provided for this image
----------------------------------------------------------------------------------------------------
Shape of DataFrame:

(442, 11)

----------------------------------------------------------------------------------------------------
DataFrame Info:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 442 entries, 0 to 441
Data columns (total 11 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   age     442 non-null    float64
 1   sex     442 non-null    float64
 2   bmi     420 non-null    float64
 3   bp      442 non-null    float64
 4   tc      420 non-null    float64
 5   ldl     442 non-null    float64
 6   hdl     442 non-null    float64
 7   tch     420 non-null    float64
 8   ltg     442 non-null    float64
 9   glu     442 non-null    float64
 10  target  442 non-null    float64
dtypes: float64(11)
memory usage: 38.1 KB
----------------------------------------------------------------------------------------------------
Describe DataFrame:

Ā  count mean median std min 25% 50% 75% max skewness kurtosis
age 442 48.518 50 13.109 19 38.25 50 59 79 -0.219726 -0.714041
sex 442 0.532 1 0.5 0 0 1 1 1 -0.085793 -1.992640
bmi 420 26.358 25.7 4.405 18 23.175 25.7 29.325 42.2 0.582185 0.059985
bp 442 94.647 93 13.831 62 84 93 105 133 0.271068 -0.531885
tc 420 188.831 186 34.691 97 164 186 209 301 0.383226 0.226904
ldl 442 115.439 113 30.413 41.6 96.05 113 134.5 242.4 0.430437 0.538215
hdl 442 49.788 48 12.934 22 40.25 48 57.75 99 0.790610 0.987366
tch 420 4.072 4 1.297 2 3 4 5 9.09 0.737344 0.444940
ltg 442 4.641 4.62 0.522 3.258 4.277 4.62 4.997 6.107 0.300617 -0.160402
glu 442 91.26 91 11.496 58 83.25 91 98 124 0.220172 0.253283
target 442 152.133 140.5 77.093 25 87 140.5 211.5 346 0.430462 -0.876956
----------------------------------------------------------------------------------------------------
DataFrame Correlation:

No description has been provided for this image
----------------------------------------------------------------------------------------------------
DataFrame Pairplot:

No description has been provided for this image
----------------------------------------------------------------------------------------------------
Histogram(s) & Boxplot(s):

No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
----------------------------------------------------------------------------------------------------
Countplot(s):

No description has been provided for this image

Classification¶

Import Iris Data Set¶

InĀ [69]:
# Load the data set from sklearn
iris_sklearn = load_iris()

# Extract the data and target labels as a numpy array
iris_npy = np.concatenate([iris_sklearn['data'], np.atleast_2d(iris_sklearn['target']).T], axis=1)

# Define column names
col_names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'target']

# Convert the numpy array to a pandas dataframe with column names
iris = pd.DataFrame(iris_npy, columns=col_names)

# Print a description of the dataset
print(iris_sklearn['DESCR'])
.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class
:Attribute Information:
    - sepal length in cm
    - sepal width in cm
    - petal length in cm
    - petal width in cm
    - class:
            - Iris-Setosa
            - Iris-Versicolour
            - Iris-Virginica

:Summary Statistics:

============== ==== ==== ======= ===== ====================
                Min  Max   Mean    SD   Class Correlation
============== ==== ==== ======= ===== ====================
sepal length:   4.3  7.9   5.84   0.83    0.7826
sepal width:    2.0  4.4   3.05   0.43   -0.4194
petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)
============== ==== ==== ======= ===== ====================

:Missing Attribute Values: None
:Class Distribution: 33.3% for each of 3 classes.
:Creator: R.A. Fisher
:Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
:Date: July, 1988

The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fisher's paper. Note that it's the same as in R, but not as in the UCI
Machine Learning Repository, which has two wrong data points.

This is perhaps the best known database to be found in the
pattern recognition literature.  Fisher's paper is a classic in the field and
is referenced frequently to this day.  (See Duda & Hart, for example.)  The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant.  One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.

|details-start|
**References**
|details-split|

- Fisher, R.A. "The use of multiple measurements in taxonomic problems"
  Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
  Mathematical Statistics" (John Wiley, NY, 1950).
- Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.
  (Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.
- Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
  Structure and Classification Rule for Recognition in Partially Exposed
  Environments".  IEEE Transactions on Pattern Analysis and Machine
  Intelligence, Vol. PAMI-2, No. 1, 67-71.
- Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule".  IEEE Transactions
  on Information Theory, May 1972, 431-433.
- See also: 1988 MLC Proceedings, 54-64.  Cheeseman et al"s AUTOCLASS II
  conceptual clustering system finds 3 classes in the data.
- Many, many more ...

|details-end|

InĀ [70]:
iris['target'].sample(5)
Out[70]:
121    2.0
16     0.0
19     0.0
23     0.0
8      0.0
Name: target, dtype: float64
InĀ [71]:
class_names = dict(zip(list(map(float, range(len(iris_sklearn['target_names'])))), iris_sklearn['target_names']))
print(class_names)
{0.0: 'setosa', 1.0: 'versicolor', 2.0: 'virginica'}

Performing EDA for classification using fasteda¶

InĀ [73]:
fast_eda(iris, target = 'target')
DataFrame Head:
sepal_length sepal_width petal_length petal_width target
0 5.1 3.5 1.4 0.2 0.0
1 4.9 3.0 1.4 0.2 0.0
2 4.7 3.2 1.3 0.2 0.0
DataFrame Tail:
sepal_length sepal_width petal_length petal_width target
147 6.5 3.0 5.2 2.0 2.0
148 6.2 3.4 5.4 2.3 2.0
149 5.9 3.0 5.1 1.8 2.0
----------------------------------------------------------------------------------------------------
Missing values:
Ā  0
----------------------------------------------------------------------------------------------------
Shape of DataFrame:

(150, 5)

----------------------------------------------------------------------------------------------------
DataFrame Info:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   target        150 non-null    float64
dtypes: float64(5)
memory usage: 6.0 KB
----------------------------------------------------------------------------------------------------
Describe DataFrame:

Ā  count mean median std min 25% 50% 75% max skewness kurtosis
sepal_length 150 5.843 5.8 0.828 4.3 5.1 5.8 6.4 7.9 0.311753 -0.573568
sepal_width 150 3.057 3 0.436 2 2.8 3 3.3 4.4 0.315767 0.180976
petal_length 150 3.758 4.35 1.765 1 1.6 4.35 5.1 6.9 -0.272128 -1.395536
petal_width 150 1.199 1.3 0.762 0.1 0.3 1.3 1.8 2.5 -0.101934 -1.336067
target 150 1 1 0.819 0 0 1 2 2 0.000000 -1.500000
----------------------------------------------------------------------------------------------------
DataFrame Correlation:

No description has been provided for this image
----------------------------------------------------------------------------------------------------
DataFrame Pairplot:

No description has been provided for this image
----------------------------------------------------------------------------------------------------
Histogram(s) & Boxplot(s):

No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
----------------------------------------------------------------------------------------------------
Countplot(s):

No description has been provided for this image
InĀ [74]:
plt.axis('equal')
sns.scatterplot(iris, x='petal_width', y='sepal_width', hue='target', palette=sns.color_palette("hls", iris['target'].nunique()))
plt.show()
No description has been provided for this image
InĀ [75]:
# Define a function to format value counts into percentages
def autopct_format(values):
        def my_format(pct):
            total = sum(values)
            val = int(round(pct*total/100.0))
            return '{:.1f}%\n({v:d})'.format(pct, v=val)
        return my_format

# Get value counts
vc = iris['target'].value_counts()

# Draw a pie chart using value counts and the `autopct_format` format
_ = plt.pie(vc, labels = vc.rename(class_names).index, autopct=autopct_format(vc))
No description has been provided for this image